Skip to content

feat(codex): per-turn watchdog with TurnWatchdogError + exit 124#312

Open
OJamals wants to merge 1 commit into
openai:mainfrom
OJamals:feat/turn-watchdog
Open

feat(codex): per-turn watchdog with TurnWatchdogError + exit 124#312
OJamals wants to merge 1 commit into
openai:mainfrom
OJamals:feat/turn-watchdog

Conversation

@OJamals
Copy link
Copy Markdown

@OJamals OJamals commented May 10, 2026

Problem

runAppServerTurncaptureTurn (plugins/codex/scripts/lib/codex.mjs) awaits state.completion indefinitely. Completion resolves on turn/completed JSON-RPC notifications or on the 250ms scheduleInferredCompletion timer (which only fires after finalAnswerSeen=true). If the JSON-RPC stream stalls entirely — model inference hang, network drop without TCP error, app-server stuck — no notifications arrive, the timer never fires, and the await hangs forever.

--wait controls foreground vs background dispatch only; it is not a runtime bound.

Downstream wrappers currently work around this by wrapping every Codex review call with an external timeout(1) / perl alarm shell guard. That works but is fragile (loses structured error info, only available where the wrapper exists).

Fix

Per-turn heartbeat watchdog at the captureTurn layer:

  1. New exported TurnWatchdogError extends Error with code = 'TURN_WATCHDOG_TIMEOUT' and exitCode = 124 (matching timeout(1) convention).
  2. createTurnCaptureState accepts watchdogMs and tracks watchdogTimer.
  3. captureTurn arms the watchdog on entry, resets it on every notification (heartbeat), disarms in finally.
  4. On watchdog fire: state.rejectCompletion(new TurnWatchdogError(...)), which propagates out of await state.completion and through runAppServerTurn to the caller.
  5. runAppServerTurn reads options.watchdogMs ?? Number(process.env.CODEX_TURN_WATCHDOG_MS) || null and forwards.
  6. codex-companion.mjs main().catch maps TurnWatchdogError → process.exitCode = 124 and writes a structured {error: 'TurnWatchdogTimeout', ...} JSON line to stderr.

Heartbeat semantics

An active turn streaming notifications resets the timer indefinitely. Only true silence triggers abort. Default null preserves existing behavior bit-for-bit. Opt-in via CODEX_TURN_WATCHDOG_MS or per-call options.watchdogMs.

Scenario Watchdog OFF (default) Watchdog 600000 (10min)
Healthy turn, finishes in 5s passes passes
Long but active turn, 8min, notifications every 30s passes passes (heartbeat resets)
Stalled turn, no notifications for 10min hangs forever rejects with TurnWatchdogError, exit 124
Stalled turn, no notifications for 30s hangs forever passes (within window)

Tests

tests/turn-watchdog.test.mjs covers TurnWatchdogError shape: code, exitCode, message, threadId/turnId/watchdogMs metadata, default-null fallbacks. Full node --test tests/*.test.mjs run is green for all suites unaffected by pre-existing darwin tmp-dir flakes (state/runtime tests; same 4 fail on untouched v1.0.4).

captureTurn is internal, so the included unit test exercises the exported error class. A behavioral test would require either exporting captureTurn or wiring a fake JSON-RPC client through runAppServerTurn; happy to extend if maintainers prefer.

Files

  • plugins/codex/scripts/lib/codex.mjs+77/-1
  • plugins/codex/scripts/codex-companion.mjs+15/-1
  • tests/turn-watchdog.test.mjs+31 (new)

Total: +122/-2. No public API changes; one new export (TurnWatchdogError).

Context

This was discovered while investigating stalled /codex:review runs in a downstream Claude Code integration. Full investigation notes + design doc:

  • Investigation: stalled review run, executeReviewRun blocked on state.completion with no diagnostic.
  • Workaround in use today: external perl alarm 600s wrapper.
  • Goal of this PR: native per-turn watchdog so the wrapper can be retired and structured TurnWatchdogError info propagates to callers.

Adds an opt-in heartbeat watchdog at the captureTurn layer that aborts a turn when no JSON-RPC notification arrives within a configurable window. Default behavior preserved: opt-in via CODEX_TURN_WATCHDOG_MS env or per-call options.watchdogMs.

Problem: runAppServerTurn -> captureTurn awaits state.completion indefinitely. The 250ms scheduleInferredCompletion timer only fires after finalAnswerSeen=true. When the JSON-RPC stream stalls entirely (model inference hang, network drop without TCP error, app-server stuck), no notifications arrive, the timer never fires, and the await hangs forever. --wait controls foreground vs background dispatch only; it is not a runtime bound.

Fix: TurnWatchdogError carries code='TURN_WATCHDOG_TIMEOUT' + exitCode=124. createTurnCaptureState accepts watchdogMs. captureTurn arms the watchdog before startRequest, kicks (resets) it on every notification, and disarms in finally. On fire, the watchdog rejects state.completion with TurnWatchdogError, propagating through await + finally to the caller. codex-companion main().catch maps TurnWatchdogError to exit 124 and emits a structured error JSON.

Heartbeat semantics: an active turn streaming notifications resets the timer indefinitely. Only true silence triggers abort.

Tests: tests/turn-watchdog.test.mjs covers TurnWatchdogError shape (code, exitCode, message, threadId/turnId/watchdogMs metadata, default null fallbacks).

Refs: this addresses a stalled-turn class that downstream wrappers (e.g. /smart-review in https://github.com/anthropics/claude-code workflows) currently mask with external alarm timeouts.
@OJamals OJamals requested a review from a team May 10, 2026 11:20
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 186c9adbf1

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +646 to 647
armWatchdog(state);
const response = await startRequest();
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Race the watchdog against the start request

When turn/start itself stops responding after the timer is armed, this only rejects state.completion; captureTurn is still blocked on await startRequest() and never observes that rejection, so the watchdog either leaves the call hung or surfaces as an unhandled rejection instead of returning the structured exit-124 error. This affects the same stalled JSON-RPC connection scenario before the start response is delivered; the timer needs to be raced with startRequest() or otherwise abort the pending request path too.

Useful? React with 👍 / 👎.

Comment on lines +1083 to +1086
watchdogMs:
typeof options.watchdogMs === "number"
? options.watchdogMs
: Number(process.env.CODEX_TURN_WATCHDOG_MS) || null
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Apply the watchdog to native review turns

The env-backed watchdog is only forwarded from runAppServerTurn, but the default review path calls runAppServerReview, which also awaits captureTurn and still passes no watchdogMs. As a result, setting CODEX_TURN_WATCHDOG_MS does not protect native /codex:review runs from the same silent JSON-RPC stall, even though that path uses the same capture layer; forward the option/env value from runAppServerReview as well.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants